We are consulting for a real estate company that has a niche in rental property investment for short-term gains as part of their business model, specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable. Now, they need to find out which zip codes are the best to invest in in terms of generating the most profit on short term retals within New York City.
The primary data sources from both Zillow and AirBnB are public available. For the cost data, Zillow provides us an estimate of value for two-bedroom properties. For the revenue data, AirBnB is the medium through which the investor plans to lease out their investment property.
Our objective is here is to find a way to help this real estate company to find out the zipcodes that can generating the most short-term return from the two bedrooms properties they invest. And the main approach we use will be centered around the exploratory analysis along with business interpretation.
Explicitly, we define the success of this analytics project as providing top zipcodes in New York City meeting the client’s expectations. In addtion to answering this specific business question, we should also clarify our exploratory logic and conduct the analysis in a scalable manner so it can be applied with either new data or new scenarios in the future.
To achieve this goal, our analysis should facilitate the company to dive deep in the collected data for hidden insights by primarily focusing on:
1) deploy sophisticated data treatment to filter out those useful information.
2) interpret the analysis result accordingly to decide the zipcodes with high profitability.
Initially, there are four main assumptions provided by the client:
1. The occupancy rate is 75%.
2. All investment will be paid in cash at once with no mortgage, thus no concerns on interest rate.
3. The time value of money discount rate is 0%.
4. All properties and all square feet within each locale can be assumed to be homogeneous.
As the anlysis goes, more assumptions were made by us in order to simplify the procedure without critical influence on the integrity of our analysis result. Those assumptions are mentioned in the following context respectively and also summarized as followed:
1. When calculate the Cap Rate, instead of net operating income, we use annual revenue without concerns on expense.
2. When using zillow dataset to predict the current cost of properties in July 2019 (latest data in airbnb dataset), we assume there is seasonality in the price and also that values depend not only on previous values (Auto Regressive AR) but also on diiferences between previous values (Moving Average MA).
3. About the price of those property that rent out the space as private rooms, we assume that both rooms are booked at the same time. Thus, the occupancy rate is same as regular entire apartment.
4. When calculate revenue, we assume cleaning fee is not considered.
5. In addition to availability, we intuitively assume that more reviews and longer listeing time equate to popular properties and popular neighborhood/zipcode
6. No concern on property appreciation and other financial factors.
In this section, we have four parts:
1) Initial setups, including loading R libraries and reading datasets
2) Data treatment on airbnb dataset, including cleaning, munging, and filtering
3) Data treatment on zillow dataset, including cleaning, munging, and filtering
4) Data join, along with some some simple feature engineering
In this part, to simplify the process of installing, updating, and loading all necessary R packages, we used a special open source package lubripack. This package can be used to easily load multiple pacakges at once and automatically install and update them when necessary.
# install this free devtool for automatic installing and loading packages
library(devtools)
install_github("espanta/lubripack")
library(lubripack)
# list all pacakges
pkgs <- c("data.table", "dplyr", "ggplot2", "tidyr", "naniar", "colorspace",
"forecast", "astsa", "Amelia", "scales", "GGally", "Matrix",
"plotly","maps","kableExtra","matrixStats","mice","caret","DataExplorer","processx")
# load all packages at once
lubripack(pkgs)##
## Bellow Packages Successfully Installed:
##
## data.table dplyr ggplot2 tidyr naniar
## TRUE TRUE TRUE TRUE TRUE
## colorspace forecast astsa Amelia scales
## TRUE TRUE TRUE TRUE TRUE
## GGally Matrix plotly maps kableExtra
## TRUE TRUE TRUE TRUE TRUE
## matrixStats mice caret DataExplorer processx
## TRUE TRUE TRUE TRUE TRUE
AirBnB data: (Downloaded from http://data.insideairbnb.com/united-states/ny/new-york-city/2019-07-08/data/listings.csv.gz) This is the dataset that will provide revenue infomation for rental properties listed in the New York area. It was last uplated in July 2019.
Zillow: (Provided seperately by the client) This ZIllow Home Value INdex(ZHVI) file is the time series dataset that will provide valuation for 2 bedrooms properties across multiple states, including New York, in a wide range of time period from the 90s to June 2017.
airbnb <- fread("listings.csv",header = TRUE, sep = "," ,
stringsAsFactors = TRUE, na.strings = c("","NA"))
zillow <- fread("Zip_Zhvi_2bedroom.csv",header = TRUE, sep = "," ,
stringsAsFactors = TRUE, na.strings = c("","NA"))First, we should have a basic idea about the dimension of those two datasets.
## [1] 48895 106
## [1] 8946 262
The Airbnb listings data contains 48895 rows and 106 columns, while the Zillow dataset contains 8946 rows and 262 columns.
First of all, we will conduct some data cleaning procedures seperately before we proceed to extract only the data for 2 bedrooms properties in NYC. Then we will generate a user defined function to help filtering out useful and qualified columns. The entire cleaning procedure can also be used in the future when newer airbnb data become avaiable.
As the result indicated (see appendix 1), the majority of description columns and host related information are blank. Steps are taken in the later section to filter columns with higher percentage of Nulls/NA. We will focusing on those columns that are critical for our analysis.
First of all, unfortunately, zipcode column contains 488 missing values. Ignoring these values can partially compromise the intergrity of the analysis. Zipcodes are imputed by selecting a non-NA value from Neighbourhood Group Cleansed.
airbnb <-
airbnb %>% group_by(neighbourhood_group_cleansed) %>% fill(zipcode) %>% ungroup()
sum(is.array(airbnb$zipcode))## [1] 0
Total Number of NA values in Zipcode Column after Imputation is 0.
The rest of
None of the price columns: Price, weekly price & Monthly price should have zero or negative value.
revCharCol <- colnames(airbnb %>% ungroup() %>% select_if(is.character))
revZeroNeg <-
sapply(airbnb[,!(names(airbnb) %in% revCharCol)], function(x)
count(x <= 0, na.rm = TRUE))
data.frame(revZeroNeg)##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 4569 34985 6497 2069 525 147 42 21 8 5 3 1
## 14
## 1
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 1094 31104 10392 3635 1508 571 287 89 67 33 19 17
## 12 13 14 15 16 17 21 22 26 40
## 15 8 2 4 4 1 2 1 1 1
There are 0 or NA input in these three columns, but no negative input was found. However, it is unusual that there are 4569 properties that have no bedroom while there are 1094 properties even have no bed. But it is possible that those property are having other kind of beds, especailly given that there is no missing value in the bed_type column.
##
## Airbed Couch Futon Pull-out Sofa Real Bed
## 170 73 295 261 48096
This result confirms our assumption. Thus we don’t need to pay extra attention to the number of bedrooms at this point.
Also, there are 11 inputs in the price column that has a non-positive value. Since the number is fairly small, we can drop them without affecting the integrity of the analysis.
First of all, we filter out those records that are within our business scope to simplify the process and save computational power. In this case, it will be those 2 bedrooms properties in New York City.
Then, in order to remove columns that add little or no value to the analysis in a scalable way, some of the smart data munging techniques are incorporated. These include removing: 1) columns based on pattern matching with their names, 2) imbalanced columns, and 3) character columns with 100 % variance, etc.
For the main part in data filtering, we use associated methods to remove these columns from airbnb dataset and store the procedures as a user defined function (UDF) for future use. After that, we mannully exclude columns that contribute no significant value for further analysis.
Before proceeding to the next step, keep only those records that are from the New York City and has 2 bedrooms in the property given the business context. However, as it was noticed that there are many variations of NYC in the city column that requires much effort to eliminate, we can simply use state at this point for a preliminary procedure.
## [1] NY Ny ny MP New York CA NJ
## [9] <NA>
## Levels: CA MP NJ NY New York Ny ny
Although the airbnb dataset per se is generated with the criteria of listing in New York City, it has many variations in the city column and even include cities from other state.
state is chosen here for filtering rather than city because it is less likely to have typos in the name. Also, since we are going to merge Airbnb data and Zillow data, a preliminary filtering here will be sufficient to reduce the computational cost. More work can be done if needed as we proceed. The steps are listed here as reference for other applications in the future.
nBedrooms = 2 #### can be changed ####
airbnb$state <- (gsub("Ny","NY",airbnb$state))
airbnb$state <- (gsub("ny","NY",airbnb$state))
airbnb$state <- (gsub("New York","NY",airbnb$state))
airbnb <- airbnb[which(airbnb$state=="NY" & airbnb$bedrooms == nBedrooms),]Now, the airbnb dataset includes only those records that are within our business scope of 2 bedrooms rental properties in New York City.
We creat a cleanAibnb function here as a scalable solution to removes columns that will not be very useful in our case of analyze zipcode profitability. For reference purpose, all dropped columns are also listed. See Appendix 3 for details.
1. Zero Variance Cols
Imbalanced/Zero Variance columns add no value to the analysis. These columns are removed using nearzeroVar method from Caret package.
2. Pattern Matching
Column names starting with “require”, “host”, “calendar” and ending with “url” and “nights” are irrelevant information when the property is invested in, by the real estate company.
3. Based on NA Values
Columns with over 60% NA values are removed without mercy.
The clean Airbnb Function
cleanAirbnb <- function(tempdf){
NApct = 0.6 #### can be changed #### # see sec 3 in this chunk
# --------------------------------------
# 1. Zero Variance Cols
# Columns Removed:
zvCol <- nearZeroVar(tempdf, saveMetrics = TRUE)
zvNames=rownames(subset(zvCol, nzv== TRUE))
tempdf <- tempdf[ , !(names(tempdf) %in% zvNames)]
print("------ removed due to Zero Variance------")
print(zvNames)
print("--------------------------------------")
# 2. Pattern Matching
# Columns Removed:
pattern <-
colnames(
tempdf %>% select(
starts_with("require"),
starts_with("host"),
starts_with("calendar"),
ends_with("url"),
ends_with("nights")
)
)
tempdf <- tempdf[,!(names(tempdf) %in% pattern)]
print("------Columns removed due to defined patterns------")
print(pattern)
print("--------------------------------------")
# 3. Based on NA Values
# Columns with over 60% NA values are removed without mercy.
# Columns Removed:
nadf <- tempdf %>% summarise_all(funs(sum(is.na(.))))
nadf <- nadf %>% gather(key = var_name, value = value, 1:ncol(nadf))
nadf$numNa <- round(nadf$value/nrow(tempdf),2)
naval <- nadf %>% filter(numNa > NApct) %>% pull(var_name)
tempdf <- tempdf[,!(names(tempdf) %in% naval)]
print("------Columns removed due to more than 60% NA values------")
print(naval)
print("--------------------------------------")
return(tempdf)
}Call the cleanAirbnb function and see what are those columns that have been removed.
## [1] "------ removed due to Zero Variance------"
## [1] "scrape_id"
## [2] "experiences_offered"
## [3] "thumbnail_url"
## [4] "medium_url"
## [5] "xl_picture_url"
## [6] "host_acceptance_rate"
## [7] "host_has_profile_pic"
## [8] "state"
## [9] "market"
## [10] "country_code"
## [11] "country"
## [12] "bedrooms"
## [13] "bed_type"
## [14] "has_availability"
## [15] "requires_license"
## [16] "license"
## [17] "jurisdiction_names"
## [18] "is_business_travel_ready"
## [19] "require_guest_profile_picture"
## [20] "require_guest_phone_verification"
## [21] "calculated_host_listings_count_shared_rooms"
## [1] "--------------------------------------"
## [1] "------Columns removed due to defined patterns------"
## [1] "host_id" "host_url"
## [3] "host_name" "host_since"
## [5] "host_location" "host_about"
## [7] "host_response_time" "host_response_rate"
## [9] "host_is_superhost" "host_thumbnail_url"
## [11] "host_picture_url" "host_neighbourhood"
## [13] "host_listings_count" "host_total_listings_count"
## [15] "host_verifications" "host_identity_verified"
## [17] "calendar_updated" "calendar_last_scraped"
## [19] "listing_url" "picture_url"
## [21] "minimum_nights" "maximum_nights"
## [23] "minimum_minimum_nights" "maximum_minimum_nights"
## [25] "minimum_maximum_nights" "maximum_maximum_nights"
## [1] "--------------------------------------"
## [1] "------Columns removed due to more than 60% NA values------"
## [1] "square_feet" "weekly_price" "monthly_price"
## [1] "--------------------------------------"
One unexpected column that has been removed is the square_feet column with 48487 NA values, which is approximately 99.2% of the total number of records. Since this information is usually important for real estate investment, some external data can be included if necessary. For this case, we will just leave it out of our concern.
Character columns with near 100% variance (Every Row is different) are removed as they provide no group level information that can be used on a larger population/scale. These columns include textual columns describing the home, host, amenties etc. For lack of conclusion from other variables, these columns can be revisited for sentiment analysis.
Also, other columns that provide no specific benefit for our profitibility analysis in this case are also removed.
dropCol <-
c(
'name', 'summary', 'space', 'description', 'neighborhood_overview', 'notes',
'transit', 'access', 'interaction', 'house_rules', 'amenities',
'street', 'city', 'smart_location', 'is_location_exact', 'security_deposit',
'last_scraped', 'accommodates','neighbourhood','beds','bathrooms',
#'cleaning_fee',
'guests_included', 'extra_people', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
'number_of_reviews_ltm', 'review_scores_communication', 'review_scores_checkin',
'review_scores_cleanliness', 'review_scores_value', 'reviews_per_month',
'review_scores_accuracy', 'review_scores_location',
'instant_bookable', 'cancellation_policy', 'is_location_exact',
'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms'
)
airbnb <- airbnb[,!(names(airbnb) %in% dropCol)]Last step in data cleaning, let’s double check if there is any duplicated rows.
No duplicated rows has been found in the cleaned airbnb dataset.
Majority of columns in zillow Dataset are numeric input account for the median price of 2-bedroom properties between year 1996 and 2017 and spread monthly.
Median Price for early years (1996-2013) has plenty of Nulls as shown in the Appendix 2. This is also not consistent across all Regionnames. Steps are taken in the next section to filter out columns with high percentage of Nulls/NA.
costCharCol <- colnames(zillow %>% ungroup() %>% select_if(is.character))
costZeroNeg <-
sapply(zillow[, !(names(zillow) %in% costCharCol)], function(x)
count(x <= 0, na.rm = TRUE))
table(costZeroNeg)## costZeroNeg
## 0
## 262
There is no negative or zero input in all the non character columns(Int/Numeric) as shown in the output table.
The zillow dataset only provide cost information as late as June 2017, while the airbnb dataset is two years ahead. It is obvious that this discrepency between cost and revenue will compromise the integrity of our analysis. So we use Time Series Forecasting (ARIMA in this case) to predict the lastest price.
We incorporated this predicitive model within a function called cleanZillow where the input to the function is the zillow Dataset and the city where we want to buy the property. The various steps performed in the function are as followed:
Select only relevant columns such as RegionName which denotes the zipcode, city, SizeRank, and the predicted cost of the property in July 2019.
Filter out the data to include only those rows which belong to the city provided as a function argument, which in our case is New York City.
Due to shortage of time, we will assume that there is seasonality in the price and also that values depend not only on previous values (Auto Regressive AR) but also on diiferences between previous values (Moving Average MA).
Attention: we applied ARIMA model with the assumpition of having seasonality to predict the cost of the properties in Zipcodes from July 2017 to July 2019. We then attach the price of property in July 2019 (calculated at zipcode level) to a new column named cost.
The clean Zillow Function
cleanZillow <- function(tempdf,cityName){ # Zillow data and New york city given as function arguement
tempdf <- tempdf[,c(2,3,7,226:262)] # Select only relevant columns
tempdf <- filter(tempdf,City==cityName) # Filter for the required city
colnames(tempdf)[colnames(tempdf)=="RegionName"] <- "zipcode" # Set proper column name to be used for merging later
tempdf$cost <- NULL # Create a new column to store the latest price in May 2018
# we define a for loop to iterate over each zipcode to obtain latest cost of property
for(i in 1:nrow(tempdf)){
tmp = ts(as.vector(t(tempdf[,c(4:40)])[,i]),start = c(2014,6),frequency = 12) # Convert the monthly cost data into time series data
ARIMAfit = arima(tmp, order=c(1,1,1), seasonal=list(order=c(1,0,1),period=NA),
method="ML")# Define ARIMA model to be used for prediction
pred = predict(ARIMAfit, n.ahead = 25)# use the ARIMA model to predict the price from July 2017 to July 2019
#### n.ahead to be changed ####
predval <- pred$pred # Store the predicted values in a variable
tempdf$cost[i] <- predval[length(predval)] # set the value of current price for the specific zipcode as price in July 2019
}
return(tempdf[,c(1,2,3,41)]) # return the filtered columns containing only zipcode, City, SizeRank, and cost
}We call the above function by passing the available Zillow Data and also the city name as New York
city="New York"
zillow <- cleanZillow(zillow,city) # call cleanZillow function
data.frame(zillow) # view the structure of clean and filtered Zillow dataThe final clean data contains 4 columns i.e. Zipcode, City, SizeRank and the current price of property in the particular zipcode. There are 25 rows where each row describes unqiue zipcode.
Last step in data cleaning, let’s double check if there is any duplicated rows.
No duplicated rows has been found in the cleaned zillow dataset.
The final dataset containing both revenue and cost data is merged based on common zipcode in both cleaned airbnb and cleaned zillow dataset.
## [1] 1577 21
Let’s have a preview on the merged dataset final.
## Observations: 1,577
## Variables: 21
## $ zipcode <fct> 10003, 10003, 10003, 10003, 10003, …
## $ id <fct> 31352270, 18201247, 8595349, 428619…
## $ neighbourhood_cleansed <fct> NoHo, East Village, East Village, E…
## $ neighbourhood_group_cleansed <fct> Manhattan, Manhattan, Manhattan, Ma…
## $ latitude <dbl> 40.72811, 40.73218, 40.72756, 40.73…
## $ longitude <dbl> -73.99381, -73.98815, -73.98588, -7…
## $ property_type <fct> Loft, Apartment, Apartment, Apartme…
## $ room_type <fct> Entire home/apt, Entire home/apt, E…
## $ price <dbl> 525, 285, 280, 199, 100, 999, 250, …
## $ cleaning_fee <dbl> 120, 150, 50, 0, 50, 200, 95, 25, 1…
## $ availability_30 <int> 13, 29, 0, 4, 0, 23, 0, 0, 4, 29, 1…
## $ availability_60 <int> 43, 59, 0, 5, 0, 44, 13, 0, 34, 59,…
## $ availability_90 <int> 63, 89, 0, 7, 0, 74, 13, 0, 35, 89,…
## $ availability_365 <int> 63, 364, 0, 232, 0, 164, 13, 0, 197…
## $ number_of_reviews <int> 5, 0, 2, 8, 9, 11, 20, 11, 27, 0, 0…
## $ first_review <fct> 2019-04-15, NA, 2016-01-02, 2016-02…
## $ last_review <fct> 2019-06-02, NA, 2016-05-28, 2018-12…
## $ review_scores_rating <int> 64, NA, 80, 90, 93, 100, 93, 98, 92…
## $ City <fct> New York, New York, New York, New Y…
## $ SizeRank <int> 21, 21, 21, 21, 21, 21, 21, 21, 21,…
## $ cost <dbl> 2214644, 2214644, 2214644, 2214644,…
Upon inspection on the matching of zipcode and neighbourhood group, we find that 10013 has been assigned to both Manhattan and Brooklyn, which should not have happend.
After research, it is clear that 10013 should only belong to Manhattan, thus we make the correction here.
For those 2 bedrooms listings that offer private rooms, price of the daily rental in Revenue data is reflective of the space that is offered but not the entire property itself. The price must be specifically corrected to account for entire property to account the benefit.
Assumption Made: For those property that rent out the space as private rooms, we assume that both rooms are booked at the same time. Thus, the occupancy rate is same as regular entire apartment. If the property type is Private Room, it is multipled by number of bedrooms to account for overall price. Correction applied is returned to original price column.
final <- final %>% mutate(price = if_else(room_type == "Private room",
price * nBedrooms, # bedroom is predifined to be 2 in this case
price)) Create a function to normalize certain columns into 0-1 scale based on the min-max value of the column so they can contribute value to the analysis in the following section.
First, we will normalize the number_of_reviews column.
final["normalized_number_of_reviews"] <- lapply(final["number_of_reviews"], normalize)
summary(final$normalized_number_of_reviews)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.002481 0.009926 0.049373 0.042184 1.000000
This summary indicates that there are a very small number of popular properties that are getting lots of reviews while the majority of them get only a handful of reviews.
List and compare the NA values and percentage of the final dataset before further feature engineering and analysis.
# Check number of NA in all the columns
missingValues <- as.data.frame(colSums(sapply(final,is.na)))
# Convert rownames to columns
missingValues <- as.data.frame(setDT(missingValues, keep.rownames = TRUE))
# Rename the column names
colnames(missingValues) <- c("columnName","totalNA_values")
# Transform totalNA to percent, add it as column and arrange in descending order on the basis of it
missingValues <- missingValues %>%
mutate_at(vars(totalNA_values),funs(percentNA_values=.*100/nrow(final))) %>%
arrange(desc(percentNA_values))
# Check the top columns having maximum NA values
data.frame(missingValues)Although there are still some NA values in the final dataset, including review_scores_rating, first_review, last_review, and cleaning_fee, they will not significantly impaire our further analysis. Besides, if necessary in the future, we can consider impute the NA values in the cleaning_fee column, while the other three columns tends to be unimputable. We store those aggregate information in another dataset final_sum.
# Find average of the required columns
avg_df <- final %>%
group_by(zipcode) %>%
summarise_at(vars(price:cost),mean) # mean of current price and other price attributes
# Find unique id (no. of properties) for each zipcode
unique_num_df <- final %>% select(zipcode,id) %>%
group_by(zipcode) %>%
mutate(unique_num = n_distinct(id)) %>%
select(zipcode,unique_num) %>% distinct() # count number of properties in each zipcode as unique_id
unique_nb <- final %>%
select(zipcode, neighbourhood_group_cleansed) %>%
group_by(zipcode) %>%
select(zipcode, neighbourhood_group_cleansed) %>%
distinct()
# Combine both the dataframes to get final summary of the data
final_sum <- inner_join(avg_df, unique_num_df, by = "zipcode") # combine two agregations using zipcode
final_sum <- distinct(inner_join(final_sum, unique_nb,
by = "zipcode")) # Append neighbourhood_group_cleansed to final_sum
summary(final_sum)## zipcode price cleaning_fee availability_30
## 10003 : 1 Min. : 85.0 Min. : 30.00 Min. : 2.000
## 10011 : 1 1st Qu.:147.1 1st Qu.: 41.25 1st Qu.: 4.970
## 10013 : 1 Median :246.6 Median : 52.50 Median : 6.466
## 10014 : 1 Mean :237.0 Mean : 58.75 Mean : 9.453
## 10021 : 1 3rd Qu.:327.6 3rd Qu.: 70.00 3rd Qu.:13.089
## 10022 : 1 Max. :412.5 Max. :100.00 Max. :29.500
## (Other):18 NA's :20
## availability_60 availability_90 availability_365 number_of_reviews
## Min. : 9.00 Min. :14.70 Min. : 63.00 Min. : 5.014
## 1st Qu.:13.96 1st Qu.:22.47 1st Qu.: 94.27 1st Qu.:13.695
## Median :18.15 Median :28.93 Median :113.50 Median :19.257
## Mean :22.87 Mean :37.19 Mean :138.49 Mean :20.521
## 3rd Qu.:30.46 3rd Qu.:51.73 3rd Qu.:164.44 3rd Qu.:25.852
## Max. :56.00 Max. :86.00 Max. :360.50 Max. :43.000
##
## first_review last_review review_scores_rating City
## Min. : NA Min. : NA Min. : 89.00 Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: 91.69 1st Qu.: NA
## Median : NA Median : NA Median : 92.38 Median : NA
## Mean :NaN Mean :NaN Mean : 93.90 Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: 96.75 3rd Qu.: NA
## Max. : NA Max. : NA Max. :100.00 Max. : NA
## NA's :24 NA's :24 NA's :18 NA's :24
## SizeRank cost unique_num
## Min. : 1.0 Min. : 370164 Min. : 1.00
## 1st Qu.: 47.0 1st Qu.: 448856 1st Qu.: 7.75
## Median : 479.5 Median :1378655 Median : 72.50
## Mean :1056.6 Mean :1362156 Mean : 65.71
## 3rd Qu.:1762.2 3rd Qu.:2059063 3rd Qu.:105.50
## Max. :4647.0 Max. :3189236 Max. :189.00
##
## neighbourhood_group_cleansed
## Bronx : 0
## Brooklyn : 5
## Manhattan :11
## Queens : 1
## Staten Island: 7
##
##
From the summary above, we know that there are 24 zipcodes in New York City that have rental properties listed on Airbnb. Currently, there are at least 1 such property in any zipcode, while at most 189 such properties per zipcode.
Having the business goal of finding the zipcode whose rental properties can generate the most short-term profits, it is necessary to have a metrix to measure the performance and short term potential of those zipcode in NYC. We will look at the revenue from monthly, quarterly, and yearly basis, with the assumption of having an occupancy rate of 75% among all time.
Besides, we will refer to two most important financial ratio in rental property business: the Capitalization Rate and the Rent Rate. In this case, they are calculated as:
\[{Cap\;Rate} = \frac{Annual\;Revenue}{Total\;Investment\;on\;Property}\]
\[{Rent\;Rate} = \frac{Monthly\;Revenue}{Total\;Investment\;on\;Property}\]
# Create variables and assign value
occupancy_rate <- .75
month_days <- 30
quarter_days <- 90 # number of days in a quarter
semiyear_days <- 180 # number of days in half a year
year_days <- 365 # number of days in a year
#Sum the total cost in each zipcode by multiplying the avg price and the num of listings
final_sum$Total_cost <- final_sum$cost*final_sum$unique_num
# Revenue Calculation
# Generate the revenue for first quarter
final_sum$Revenue_month <-
occupancy_rate*month_days*(final_sum$price)*final_sum$unique_num
# Generate the revenue for first quarter
final_sum$Revenue_quarter <-
occupancy_rate*quarter_days*(final_sum$price)*final_sum$unique_num
# Generate the revenue for second quarter
final_sum$Revenue_semiyear <-
occupancy_rate*semiyear_days*(final_sum$price)*final_sum$unique_num
# Generate the revenue for first year
final_sum$Revenue_year <-
occupancy_rate*year_days*(final_sum$price)*final_sum$unique_num
## Financial Ratios
# Rent Ratio = Monthly Rent / Cost of the Property
final_sum$Rent_Rate <- final_sum$Revenue_month / final_sum$Total_cost
# Capitalization Rate = Net Operating Income / Total Cost
final_sum$Cap_Rate = final_sum$Revenue_year / final_sum$Total_cost
#final_sum$Cap_Rate_pct = as.numeric(percent(final_sum$Revenue_year / final_sum$Total_cost))
# derive the percent of properties listed for the given zipcode
final_sum$Percent_units <- final_sum$unique_num*100/sum(final_sum$unique_num)In this section, we will look at the dataset from a general perspective at neighbourhood level and then dive deeper into those zipcodes that have at least 10 properties listed to find the ones that ultimately match our business goal in terms of three metrics: Number of properties, revenue, and Cap Rate. Once top zipcodes are chosed, they will then be cross-examined in three tests on property-type diversity, rental availabilty, and popularity (indicated by the frequency of reviews).
Before we start, let’s create a function that can:
1) sort the result in descending order
2) contains only the top 10 zipcodes where there are at least 10 properties listed on AirBnB.
my_plot <- function(col_name){
v <- enquo(col_name)
n = 10 # select top n values
df_sorted_unique_num <- arrange(final_sum[final_sum$unique_num > 10,],
desc(!!v)) [1:n,]
# order data and keep only those zipcodes with more than 10 properties
# reassign factor levels
df_sorted_unique_num$zipcode <- factor(df_sorted_unique_num$zipcode)
# Return zipcodes
return(df_sorted_unique_num)
}We create a bubble plot to see the trend among total investment (x-axis), annual revenue (y-axis) and Cap Rate (size of bubble) as below. Neighbourhood is also indicated by color.
plot_ly(final_sum, y = ~Revenue_year, x = ~Total_cost, text = ~zipcode,
type = 'scatter', mode = 'markers', size = ~Cap_Rate,
color = ~neighbourhood_group_cleansed,
marker = list(opacity = 0.8, sizemode = 'diameter')) %>%
layout(title = 'Annual Revenue vs Total Cost in each Zipcode',
xaxis = list(showgrid = T),
yaxis = list(showgrid = T),
showlegend = T) %>%
layout(legend = list(x = 100, y = 0.5)) # %>% # uncomment code below for pdf fileIn general, with high cost of property, the annual revenue is also high. However, this fact does not indicate that those properties also have a considerable Cap rate. For example, 10035 has the highest revenue of about 77M and a medium high cost of about 1.79B, it only has a Cap Rate of 4.3% (small bubble); however, 10309, on the other hand, has the highest Cap Rate of 23% but only 1.95M in revenue and 451K in investment. This discovery infers the trend among revenue, investment cost, and cap rate. It is in accordance with the investment reality, but doesn’t explicitly tell us which are the best zipcodes to invest for short-term profit.
Let’s have a overview of the number of properties listed in each zipcode and aggregately in each neighbourhood.
ggplot(final, aes(zipcode, fill = neighbourhood_group_cleansed)) +
geom_bar() + labs(x = "Zipcode", y = "Number of Properties") +
geom_text(stat='count', aes(label=..count..), vjust= -0.3) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme_bw() +
theme(plot.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(fill = "Neighbourhood") +
ggtitle("Number of Properties by Zipcode and Neighbourhood") We can see that Manhattan and Brooklyn has significantly more properties than Queens and Staten Island.
When comparing the properties in each zipcode to the total properties in NYC using both number and percentage.
p1 <- my_plot(Percent_units)
# call the ordering and filtering function using Percent units as key metric
p1$zipcode <- factor(p1$zipcode,
levels = unique(p1$zipcode)[order(p1$unique_num,
decreasing = TRUE)])
p1 %>% plot_ly() %>%
add_trace(x=~zipcode, y=~unique_num, type = 'bar', name = 'Number',
marker = list(color = c('green','grey','grey','grey','grey',
'grey','grey','grey','grey','grey'))) %>%
add_trace(x=~zipcode, y=~Percent_units, type = 'scatter', mode = 'line',
name = 'Percentage', yaxis = 'y2', line=list(color = '#45171D') ) %>%
layout(title = "Top zipcodes with the most properties",
xaxis = list(title = "Zipcodes"),
yaxis = list(side = 'left', title = "Number of Properties"),
yaxis2 = list(side = 'right', overlaying = 'y', title = "Percentage Over All"),
showlegend = T) %>%
layout(legend = list(x = 0.8, y = 0.9))In this part, we show two charts with one being quarterly revenue and another one being annual revenue. 10036 makes the most revenue with 3.5M per quarter and 14.3M per year. While top 5 zipcode all have more than 2.5M in quarterly revenue and 11M in annual revenue.
The top 5 zipcodes with the most revenue are: 10036, 10003, 10013, 10025, 10011
p2 <- my_plot(Revenue_quarter) # call the ordering and filtering function using number of properties as key metric
p2$zipcode <- factor(p2$zipcode,
levels = unique(p2$zipcode)[order(p2$Revenue_quarter,
decreasing = TRUE)])
# plot bar graph for quarter 1
p2 %>% plot_ly( x = ~zipcode, y = ~Revenue_quarter, type = 'bar',
marker = list(color = c('green','grey','grey','grey','grey',
'grey','grey','grey','grey','grey'))) %>%
layout(title = "Top zipcodes by Revenue in a Quarter",
xaxis = list(title = "Zipcode"),
yaxis = list(title = "Revenue"))p3 <- my_plot(Revenue_year) # call the ordering and filtering function using number of properties as key metric
p3$zipcode <- factor(p3$zipcode,
levels = unique(p3$zipcode)[order(p3$Revenue_year,
decreasing = TRUE)])
# plot bar graph
p3 %>% plot_ly( x = ~zipcode, y = ~Revenue_year, type = 'bar',
marker = list(color = c('green','grey','grey','grey','grey',
'grey','grey','grey','grey','grey'))) %>%
layout(title = "Top Zipcodes by Revenue in a Year",
xaxis = list(title = "Zipcode"),
yaxis = list(title = "Revenue"))The Capitalization Rate is very important to rental real estate investment because it helps to evaluate real estate based on its current value and its net operating income (in this case, annual revenue). It gives them an initial yield on an investment property. An investor can look at a rising cap rate for a property and see that there’s a rise in income relative to its price. In contrast, a fall in cap rate generally indicates that there is lower rental income compared to its price.
A good cap rate is typically higher than 4 percent which is also approximately the average Cap Rate in NYC.
p4 <- my_plot(Cap_Rate) # call the ordering and filtering function using number of properties as key metric
p4$zipcode <- factor(p4$zipcode,
levels = unique(p4$zipcode)[order(p4$Cap_Rate,
decreasing = TRUE)])
# plot bar graph
p4 %>% plot_ly( x = ~zipcode, y = ~Cap_Rate, type = 'bar',
marker = list(color = c('red','grey','grey','grey','grey',
'grey','grey','grey','grey','grey'))) %>%
layout(title = "Top Zipcodes with highest Capitalization Rate",
xaxis = list(title = "Zipcodes"),
yaxis = list(title = "Cap Rate"))This trend is the same in terms of Rent Rate, while all top 10 are around 0.4% and the highest being 11434 with 0.82%.
p5 <- my_plot(Rent_Rate) # call the ordering and filtering function using number of properties as key metric
p5$zipcode <- factor(p5$zipcode,
levels = unique(p5$zipcode)[order(p5$Rent_Rate,
decreasing = TRUE)])
# plot bar graph
p5 %>% plot_ly( x = ~zipcode, y = ~Rent_Rate, type = 'bar',
marker = list(color = c('red','grey','grey','grey','grey',
'grey','grey','grey','grey','grey'))) %>%
layout(title = "Top Zipcodes with highest Rent_Ratio",
xaxis = list(title = "Zipcodes"),
yaxis = list(title = "Rent_Ratio"))The top 5 zipcodes with the highest Cap Rate & Rent Rate are: 11434, 10305, 10025, 10036, 11217
In the above analysis, we have gathered three groups of top 5 zipcodes in terms of the number of properties, the amount of revenue, and the Cap Rate. Now we will see how frequent have those zipcodes scored in the top 5 range from those three perspectives.
As the table above shows, 10025 and 10036 are in the top 5 range for all three criterias, while 10003 and 11217 have matched two criterias. 10011, 10013, 10305, 11215, and 11434 appeared only once.
We will choose 10025, 10036, 10003, 11217 as the top zipcodes for now and further inspect them for property diversity, rental availability, and popularity in the following analysis.
After choosing the 4 top zipcodes above, we have three more tests to check their performance on property diversity, rental occupancy, and online popularity.
We should also have an overview on property type to ensure the rental properties listed are diversified so one type of property will not be overly supplied.
propTypeCost <- ggplot(final, aes(x=property_type, y=cost,
fill = neighbourhood_group_cleansed)) +
scale_y_continuous(labels = scales::comma) +
geom_point(aes(col = neighbourhood_group_cleansed, size=cost, text = zipcode)) +
geom_smooth(method="loess", se=F) +
labs( x="Property type", y="Cost") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.position="bottom") +
ggtitle("Property Type vs. Investment Cost in difference zipcodes")
ggplotly(propTypeCost) %>%
layout(legend = list(x = 100, y = 0))Also, all four of our chosen top zipcodes - 10025, 10036, 10003, 11217 - are fairly diversified, while 10003 has 7 types of properties listed.
10025 has 2 types of properties listed (Apartment and Condominium);
10036 has 4 types listed (Apartment, Condominium, House, and Serviced Apartment);
10003 has 7 types listed (Apartment, Condominium, House, Loft, Other, Serviced Apartment, and Townhouse);
11217 has 6 types listed (Apartment, Condominium, Guest Suite, House, Loft, and Townhouse)
In this case, we use availability_365 as the indicator for the rental availability. By looking into the trend between rental availability and the average rental price by zipcode in different neighbourhood, it tells us the possibility of finding a popular property that can expect high occupancy and generate the most profit per stay.
availablityPrice <- final %>%
group_by(neighbourhood_group_cleansed,zipcode) %>%
summarise_all(funs(mean)) %>%
ggplot(aes(x=availability_365, y=price)) +
scale_colour_brewer(palette = "Set1") +
geom_point(aes(col=neighbourhood_group_cleansed, size=price, text=zipcode)) +
labs(x="Availability", y="Price", colour= "Neighbourhood") +
ggtitle("Rental Availability vs. Rental Price in Different Neighbourhood")
ggplotly(availablityPrice) %>%
layout(legend = list(y = 1, x = 0.8))Also, three of our chosen top zipcodes - 10025, 10003, 11217 - have a high occupancy rate while 10036 only has a medium performance.
Popularity is measured in terms of number of reviews and it is a proponent of time. We intuitively assume that more reviews and longer listeing time equate to popular properties and popular neighborhood/zipcode.
Top 10 zipcodes from the popular section are
numDayperReview <- final %>%
mutate(diff = round(difftime(last_review, first_review, units = "weeks"),0)) %>%
drop_na(diff) %>%
group_by(neighbourhood_group_cleansed,zipcode) %>%
summarise_all(funs(mean)) %>%
ggplot(aes(x = diff, y= number_of_reviews, text= zipcode,
size=neighbourhood_group_cleansed,
col = neighbourhood_group_cleansed)) +
scale_size_manual(values=c(5,3,4,2)) +
geom_point() + labs(x="Number of Weeks", y="Number Of Reviews") +
ggtitle("Popularity of Measured by Number of Reviews vs. Number of Weeks")
ggplotly(numDayperReview) %>%
layout(legend = list(x = 0, y = 1))As for the top 4 zipcode we have choosen, 10036 has the best performance because it has a relevantly high number of reviews although its listing time the shortest among those 4 zipcode. See below:
10025 has been listed for more than 74 weeks and have received about 19 reviews
10003 has been listed for more than 96 weeks and have received about 32 reviews
11217 has been listed for more than 97 weeks and have received about 27 reviews
10036 has been listed for more than 66 weeks and have received about 27 reviews
Based on the analysis over the number of properties, the amount of revenue, and the Cap Rate, we first gathered three groups of top 5 zipcodes. Then, we chosed 4 zipcodes because of their high performance accross those three metrics. Those 4 zipcodes are: 10003, 10025, 10036 in Manhattan, and 11217 in Brooklyn. 10025 and 10036 ranked top 5 for all three metrics while 10003 and 11217 ranked top 5 for two out of three metrics.
Although all of them have performed well in the following tests on property diversity, rental availability, and online popularity; none of them has a distinguishable performance accross those three tests this time.
A detailed comparison is provided as below:
## X1 X2 X3
## zipcode 10003 10025 10036
## neighbourhood_group_cleansed Manhattan Manhattan Manhattan
## unique_num 138 126 149
## price 325.8913 332.7143 351.2215
## cost 2214644 1491193 1731224
## Rent_Rate 0.003310941 0.005020189 0.004564680
## Cap_Rate 0.04028311 0.06107896 0.05553694
## Revenue_quarter 3035678 2829735 3532410
## Revenue_semiyear 6071355 5659470 7064820
## Revenue_year 12311359 11476148 14325885
## availability_365 106.51449 96.54762 163.36242
## number_of_reviews 23.60145 14.53175 19.28188
## X4
## zipcode 11217
## neighbourhood_group_cleansed Brooklyn
## unique_num 125
## price 243.8960
## cost 1322669
## Rent_Rate 0.004148928
## Cap_Rate 0.05047863
## Revenue_quarter 2057872
## Revenue_semiyear 4115745
## Revenue_year 8345816
## availability_365 70.85600
## number_of_reviews 22.36800
For this analytics project, the objective is to identify top zipcodes for our real estate clien who wants to invest in 2 bedrooms rental properties in New York city with the goal of generating considerable short-term gains. First, pre-obtained datasets from AirBnb and Zillow were pre-processed, cleaned and aggregated at zipcode level. Then, they were filtered reduce the redundancy and merged together by zipcode to get a single data table containing all necessary information. After that, Exploratory Data Analysis was conducted using interactive graphs and visuals to figure out which zipcodes are high rewarding.
As for the result, we suggest that 10025 and 10036 are the best two zipcodes to invest becuase they both have a substaintial number of properties listed and a high potential of return on investment regarding its revenue and Cap Rate. Moreover, they both have faily good performance in terms of property diversity, rental occupancy, and online popularity. Good performance in those merits and tests indicate that these two zipcodes are more likely to meet our business goal of generating short-term gains. 10003 and 11217 should also be considered carefully because they also have a similar performace and have a recognizable advantage over other zipcodes.
There are also several changes that could be adopted to make our analsis even more thorough in the future. They are:
In the final datasets, some of the NA values can be imputed if necessary.
Square feet data is significantly missing in this case but could be useful when making a real estate investment decision. External data sources can be considered if necessary.
Although this analysis has a business context of making short-term profit, we could still consider longer term profitibility as a reference.
Better prediction on investment cost, and consider the possibility of the property appreciation.
The metrics of rental availability and popularity can be modified to better reflect the performance.
Investment portofolio could be formed based on finding the best combination of different property setting in various zipcodes.
numMissingVal_airbnb <-sapply(airbnb, function(x) sum(length(which(is.na(x)))))
data.frame(numMissingVal_airbnb)numMissingVal_zillow <-sapply(zillow, function(x) sum(length(which(is.na(x)))))
data.frame(numMissingVal_zillow)